ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm)#16739
ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm)#16739ggerganov merged 15 commits intoggml-org:masterfrom
Conversation
|
@ggerganov is there something else needed from my side or are we waiting another review? |
|
There seems to be a bug somewhere. Here is repro on M4 Max: ../scripts/get-wikitext-2.sh
make -j && ./bin/llama-perplexity -hf LiquidAI/LFM2-2.6B-GGUF:Q4_K_M -f ./wikitext-2-raw/wiki.test.raw -dev none
...
# PPL sky-rockets:
0.01.007.961 I perplexity: calculating perplexity over 581 chunks, n_ctx=512, batch_size=2048, n_seq=4
0.05.476.977 I perplexity: 4.47 seconds per pass - ETA 10.82 minutes
[1]6.8941,[2]1485.3563,[3]8468.4132,[4]21269.3291,[5]4800.3655,[6]9365.2385,[7]15453.2190,[8]22744.0153,^C |
|
I was able to replicate the PPL skyrocketing with the generic implementation as well: I'll try to figure out what is going on. Edit: Also happens with Q4_0 repack. Interesting that it happens from the second chunk onwards. I'll try to run in an AVX machine and see if it's something totally unrelated to the GEMMs themselves I also compared the tensor outputs of all mul mats for a couple of llama-eval-callback runs and the results were quite identical, except for the 0.0001 deviation here and there. What I don' t understand is how I was able to run the PPL with LFM correctly, I may have messed up the GGML_CPU_REPACK in the build, sorry about that. |
|
Hm yes - |
|
I've opened #17030 for the fix.
As you pointed out, LFM2 had some MAT_MUL layers with a (6144, 256, 2, 1) tensor, wher only the first 6144*256 elements were multiplied. |
Signed-off-by: Alberto Cabrera <[email protected]>
Signed-off-by: Alberto Cabrera <[email protected]>
Signed-off-by: Alberto Cabrera <[email protected]>
Signed-off-by: Alberto Cabrera <[email protected]>
Signed-off-by: Alberto Cabrera <[email protected]>
7081eda to
8a2fd93
Compare
|
@ggerganov #17241 fixed the perplexity issues. So this PR is again ready for review (It's rebased on top of master). |
|
@ggerganov sorry for pinging again! I don't have merge rights. Could you please? |
|
It's pending review by @slaren |
|
Ah, Sorry for the misunderstanding! I got another merged in with a single review and didn't realize both approvals were needed. Thanks! |
| UNUSED(s); | ||
| UNUSED(bs); | ||
| UNUSED(vx); | ||
| UNUSED(vy); | ||
| UNUSED(nr); | ||
| UNUSED(nc); |
There was a problem hiding this comment.
I don't think these are necessary.
There was a problem hiding this comment.
Fixed, thanks for catching that. Want me to remove those as well from the other implementations that also call the generic? They have the same problem
| UNUSED(ncols_interleaved); | ||
| UNUSED(blocklen); | ||
|
|
||
| #if !((defined(_MSC_VER)) && !defined(__clang__)) && defined(__aarch64__) && defined(__ARM_NEON) |
There was a problem hiding this comment.
What's the reason for excluding MSVC here? There is no inline assembly.
There was a problem hiding this comment.
No reason, I copied the q4_0 gemv guards and began to work from there, but as you've pointed out, seeing the implementations in quants.c, it seems safe to remove !((defined(_MSC_VER)) && !defined(__clang__)). Do you want me to also remove the guards from ggml_gemv_q4_0_4x8_q8_0, ggml_gemv_iq4_nl_4x4_q8_0, ggml_gemv_q4_0_4x4_q8_0 and ggml_gemm_iq4_nl_4x4_q8_0? Those don't have inlined assembly
Co-authored-by: Diego Devesa <[email protected]>
…8mm) (ggml-org#16739)" This reverts commit dbb852b.
…ml-org#16739) * Enabled q4_K_8x8_q8_K path on ARM * wip: I8mm qs multiplication, pending bias * cpu : arm : REPACK gemm q4_K8x8 implementation Signed-off-by: Alberto Cabrera <[email protected]> * Guard gemm with proper features, improved superblock scale and min calc Signed-off-by: Alberto Cabrera <[email protected]> * cpu: arm: Implemented REPACK gemv for Q4_K Signed-off-by: Alberto Cabrera <[email protected]> * Removed completed TODO * Fixed missing guards when selecting optimal repack type for Q4_K Signed-off-by: Alberto Cabrera <[email protected]> * Fixed macro guard for gemv * Fixed wrong comment in GEMV * Fixed warning for unused variable * vdotq_s32 -> ggml_vdotq_s32 Signed-off-by: Alberto Cabrera <[email protected]> * Clang-format issues * Apply suggestions from code review Co-authored-by: Diego Devesa <[email protected]> * Removed unnecessary GGML_UNUSED * Fixed guards in q4_k gemm and gemv (repack) --------- Signed-off-by: Alberto Cabrera <[email protected]> Co-authored-by: Diego Devesa <[email protected]>
…ml-org#16739) * Enabled q4_K_8x8_q8_K path on ARM * wip: I8mm qs multiplication, pending bias * cpu : arm : REPACK gemm q4_K8x8 implementation Signed-off-by: Alberto Cabrera <[email protected]> * Guard gemm with proper features, improved superblock scale and min calc Signed-off-by: Alberto Cabrera <[email protected]> * cpu: arm: Implemented REPACK gemv for Q4_K Signed-off-by: Alberto Cabrera <[email protected]> * Removed completed TODO * Fixed missing guards when selecting optimal repack type for Q4_K Signed-off-by: Alberto Cabrera <[email protected]> * Fixed macro guard for gemv * Fixed wrong comment in GEMV * Fixed warning for unused variable * vdotq_s32 -> ggml_vdotq_s32 Signed-off-by: Alberto Cabrera <[email protected]> * Clang-format issues * Apply suggestions from code review Co-authored-by: Diego Devesa <[email protected]> * Removed unnecessary GGML_UNUSED * Fixed guards in q4_k gemm and gemv (repack) --------- Signed-off-by: Alberto Cabrera <[email protected]> Co-authored-by: Diego Devesa <[email protected]>
This PR improves q4_k_q8_k gemm and gemv in arm64 using i8mm and vecdot instructions.
Tested on an Apple M4 Max:
REPACK vs NO REPACK
REPACK: 8a2fd93 (7070)
NO REPACK: 45c6ef7 (7058)
Perplexity
As for test-backend-ops, I've checked the output of the layer tensors manually comparing REPACK vs master, since #16182 is still ongoing.